Pokémon is a popular game series where Pokémon trainers catch and train fictional creatures called Pokémon to battle other trainers. I have chosen to focus this project on Pokémon as it is a franchise recognised by almost everyone. However, not everyone is aware of the depths of data behind these beloved pocket monsters. As there is a large competitive player base and regular additions to the series, it’s possible that powercreep might occur. Powercreep is a process where new content that’s introduced to a game series is continuously more powerful than older content. This leads to newer content being favoured by players, making older content redundant.
The dataset was retrieved from Kaggle and was published by Mario Tormo Romero. The project only uses the pokedex_(Update_05.20).csv file. The data was collated from pokemondb.net and www.serebii.net. It consists of a wealth of information on all known Pokémon species (and their variations) up to the ip’s eighth generation.
## # A tibble: 1,028 x 51
## X pokedex_number name german_name japanese_name generation status
## <int> <int> <chr> <chr> <chr> <int> <chr>
## 1 0 1 Bulbasa~ Bisasam "フシギダãƒ\~ 1 Normal
## 2 1 2 Ivysaur Bisaknosp "フシギソウ~ 1 Normal
## 3 2 3 Venusaur Bisaflor "フシギãƒ\u00~ 1 Normal
## 4 3 3 Mega Ve~ Bisaflor "フシギãƒ\u00~ 1 Normal
## 5 4 4 Charman~ Glumanda "ヒトカゲ (H~ 1 Normal
## 6 5 5 Charmel~ Glutexo "リザード (L~ 1 Normal
## 7 6 6 Chariza~ Glurak "リザードン~ 1 Normal
## 8 7 6 Mega Ch~ Glurak "リザードン~ 1 Normal
## 9 8 6 Mega Ch~ Glurak "リザードン~ 1 Normal
## 10 9 7 Squirtle Schiggy "ゼニガメ (Z~ 1 Normal
## # ... with 1,018 more rows, and 44 more variables: species <chr>,
## # type_number <int>, type_1 <chr>, type_2 <chr>, height_m <dbl>,
## # weight_kg <dbl>, abilities_number <int>, ability_1 <chr>, ability_2 <chr>,
## # ability_hidden <chr>, total_points <dbl>, hp <dbl>, attack <dbl>,
## # defense <dbl>, sp_attack <dbl>, sp_defense <dbl>, speed <dbl>,
## # catch_rate <dbl>, base_friendship <dbl>, base_experience <dbl>,
## # growth_rate <chr>, egg_type_number <int>, egg_type_1 <chr>,
## # egg_type_2 <chr>, percentage_male <dbl>, egg_cycles <dbl>,
## # against_normal <dbl>, against_fire <dbl>, against_water <dbl>,
## # against_electric <dbl>, against_grass <dbl>, against_ice <dbl>,
## # against_fight <dbl>, against_poison <dbl>, against_ground <dbl>,
## # against_flying <dbl>, against_psychic <dbl>, against_bug <dbl>,
## # against_rock <dbl>, against_ghost <dbl>, against_dragon <dbl>,
## # against_dark <dbl>, against_steel <dbl>, against_fairy <dbl>
As you can see, the dataset is quite large so analysis will only focus on some of the variables. The table below provides descriptions of the variables used in analysis. The full codebook can be found on the github repository.
| Variable.name | Format | Description |
|---|---|---|
| pokedex_number | numerical | The entry number of the Pokemon in the National Pokedex |
| name | string | The English name of the Pokemon |
| generation (cleaned dataset) | string | The numbered generation which the Pokemon was first introduced presented as roman numerals |
| species | string | The Category of the Pokemon |
| combined_type (cleaned dataset) | string | The types of the Pokemon in alphabetical order |
| total_points | numerical | Total number of Base Points |
| hp | numerical | The Base HP of the Pokemon |
| attack | numerical | The Base Attack of the Pokemon |
| defense | numerical | The Base Defense of the Pokemon |
| sp_attack | numerical | The Base Special Attack of the Pokemon |
| sp_defense | numerical | The Base Special Defense of the Pokemon |
| speed | numerical | The Base Speed of the Pokemon |
The present visualisations aim to address the following questions:
Luckily there wasn’t too much data wrangling needed on this dataset.
cleaneddf <- rawdf
##change generation variable from numbers to roman numerals
cleaneddf$generation <- as.character(as.roman(cleaneddf$generation))
##rename defense and sp_defense to the English spellings
cleaneddf <- cleaneddf %>%
rename(defence = defense) %>%
rename(sp_defence = sp_defense)
##removing "Pokémon" from species values
#function to reverse strings by word
reverse_words <- function(string)
{
# split string by blank spaces
string_split = strsplit(as.character(string), split = " ")
# how many split terms?
string_length = length(string_split[[1]])
# decide what to do
if (string_length == 1) {
# one word (do nothing)
reversed_string = string_split[[1]]
} else {
# more than one word (collapse them)
reversed_split = string_split[[1]][string_length:1]
reversed_string = paste(reversed_split, collapse = " ")
}
# output
return(reversed_string)
}
#reverse word order in species column
cleaneddf$species <- sapply(cleaneddf[,8], reverse_words)
#removing "Pokémon" from species values
cleaneddf <- cleaneddf %>%
separate(species, c(NA, "species"), sep = " ", extra = "merge", fill = "left")
#reverting word order in species column
cleaneddf$species <- sapply(cleaneddf[,8], reverse_words)
##merging type_1 and type_2 columns alphabetically
cleaneddf <- cleaneddf %>%
rowwise() %>%
mutate(combined_type = paste(sort(c(type_1, type_2)), collapse = " ")) %>%
ungroup()
NB: the code for the reverse_words function was found at www.gastonsanchez.com.
Following data cleaning, I subset the data so that it only included the columns I will be using. Below is a table that shows the first 5 rows of the subset data.
## # A tibble: 5 x 12
## pokedex_number name generation species combined_type total_points hp
## <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 Bulbasaur I Seed "Grass Poiso~ 318 45
## 2 2 Ivysaur I Seed "Grass Poiso~ 405 60
## 3 3 Venusaur I Seed "Grass Poiso~ 525 80
## 4 3 Mega Venus~ I Seed "Grass Poiso~ 625 80
## 5 4 Charmander I Lizard " Fire" 309 39
## # ... with 5 more variables: attack <dbl>, defence <dbl>, sp_attack <dbl>,
## # sp_defence <dbl>, speed <dbl>
Summary statistics are presented below:
library(vtable) #required for sumtable function
#variables to include in the tables
variables <- c('total_points', 'hp', 'attack', 'defence', 'sp_attack', 'sp_defence', 'speed')
#table of summary statics across numerical variables
sumoverall <- sumtable(cleaneddf,
#specifying which variables to include in table
vars = variables,
#specifying which functions to include in the table
summ = list(c('median(x)', 'mean(x)', 'sd(x)', 'min(x)', 'max(x)')),
#specifying column names
summ.names = list(c('Median', 'Mean', 'SD', 'Minimum Value', 'Maximum Value')),
#how many decimal points to include
digits = 2,
#don't include trailing 0s
fixed.digits = FALSE,
#title of table
title = 'Summary statistics')
#print table
sumoverall
| Variable | Median | Mean | SD | Minimum Value | Maximum Value |
|---|---|---|---|---|---|
| total_points | 455 | 437.57 | 121.66 | 175 | 1125 |
| hp | 66.5 | 69.58 | 26.39 | 1 | 255 |
| attack | 76 | 80.12 | 32.37 | 5 | 190 |
| defence | 70 | 74.48 | 31.3 | 5 | 250 |
| sp_attack | 65 | 72.73 | 32.68 | 10 | 194 |
| sp_defence | 70 | 72.13 | 28.08 | 20 | 250 |
| speed | 65 | 68.53 | 29.8 | 5 | 180 |
#table of mean and sd by generation
sumbygroup <- sumtable(cleaneddf,
#specify variables
vars = variables,
#specify functions to include in table
summ = list(c('mean(x)', 'sd(x)')),
#specify column names
summ.names = list(c('Mean', 'SD')),
#group by the generation variable
group = "generation",
#how many decimal points
digits = 2,
#don't include trailing 0s
fixed.digits = FALSE,
#title of table
title = 'Mean and standard deviation by generation')
#print table
sumbygroup
| Variable | Mean | SD | Mean | SD | Mean | SD | Mean | SD | Mean | SD | Mean | SD | Mean | SD | Mean | SD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| total_points | 424.1 | 112.48 | 419.14 | 119.6 | 436.01 | 135.05 | 459.02 | 119.56 | 435.16 | 107.61 | 442.55 | 118.76 | 459.23 | 123.31 | 438.33 | 141.52 |
| hp | 64.76 | 27.35 | 71.48 | 30.4 | 66.81 | 23.96 | 73.08 | 25.11 | 72.29 | 22.61 | 69.88 | 26.1 | 71.46 | 27.22 | 70.47 | 30.01 |
| attack | 77.3 | 29.61 | 71.87 | 32.6 | 81.03 | 36.3 | 82.87 | 32.78 | 82.98 | 30.99 | 77.19 | 29.83 | 87.31 | 33.87 | 80 | 30.99 |
| defence | 71.01 | 28.59 | 73.82 | 39.19 | 74.05 | 34.73 | 78.13 | 30.15 | 72.27 | 23.03 | 77.02 | 31.2 | 79.2 | 32.39 | 75.12 | 33.59 |
| sp_attack | 69.93 | 33.68 | 66.12 | 27.8 | 75.6 | 35.13 | 76.4 | 31.91 | 70.93 | 32.1 | 75.26 | 32.45 | 77.85 | 35.63 | 71.77 | 28.87 |
| sp_defence | 68.6 | 24.9 | 74.34 | 31.51 | 71.14 | 30.73 | 77.19 | 27.5 | 68.43 | 22.17 | 75.2 | 29.65 | 75.56 | 28.73 | 72.45 | 32.51 |
| speed | 72.51 | 29.86 | 61.51 | 27.31 | 67.38 | 31.03 | 71.34 | 28.48 | 68.26 | 29.16 | 68 | 26.78 | 67.85 | 31.26 | 68.51 | 33.49 |
NB: Median values were not included in the summary statistics by generation because they are available in the visualisation
First, let’s visualise whether there is any evidence of powercreep:
library(plotly) #required to build the boxplot
###visualisation 1: total stats across generations
#graph colour palette - each colour was inspired by a game that was released in that generation
gen_colors <- c("#fad61d", #pokemon yellow
"#b4c5f6", #pokemon silver
"#5abd8b", #pokemon emerald
"#bd6ad5", #pokemon pearl
"#202029", #pokemon black
"#015f9f", #pokemon x
"#f59423", #pokemon sun
"#e5005a") #pokemon shield
#extra information to add to datapoints on hover
text <- ~paste(' Name: ', name,
'</br> Pokedex No: ', pokedex_number,
'</br> Type: ', combined_type,
'</br> Species: ', species)
#graph type
graph_type <- c("box")
#show boxpoints
boxpoints <- c("all")
#width of jitter
jitter <- c("1")
#position of jitter
pointpos <- c("-2")
#graph dimensions
width <- c("900")
height <- c("750")
#set up base plot
figtot <- plot_ly(cleaneddf,
#set y variable
y = ~total_points,
#specifying that a different colour should be used for each pokemon generation
color = ~generation,
#specifying the colour palette
colors = gen_colors,
#specify graph type
type = graph_type,
#show datapoints
boxpoints = boxpoints,
#include jitter
jitter = jitter,
#jitter to display to the left of each boxplot
pointpos = pointpos,
#specify graph dimensions
width = width,
height = height,
#adds extra information on hover, x & y values included by default
text = text) %>%
#add layout information
layout(
#add title
title = "Total base statistics across Pokémon generations",
#add x axis label
xaxis = list(title = list(text = "Generation")),
#add y axis label
yaxis = list(title = list( text = "Total points")),
#do not show legend
showlegend = FALSE)
#plot boxplot
figtot
We can see that the median total points were noticeably lower in the first three generations compared to the later generations. However, the average number of total points have remained relatively stable across generations.
There is also a general trend of the interquartile range increasing across the generations (gen 3 is the notable exception due to its large iqr compared to others). This suggests that there’s more variety in the total base points in assigned to newer Pokémon.
What about if we look at individual stats?
###visualisation 2: boxplot showing a breakdown of stats across generations
#colours for traces were inspired by bulbapedia's base stats display
hpcol<- "red"
attcol <- "orange"
defcol <- "darkmagenta" #replaced yellow to make visualisation clearer
specattcol <- "deepskyblue"
specdefcol <- "green"
speedcol <- "hotpink"
fig <- plot_ly(cleaneddf,
#specify graph type
type = graph_type,
#set graph dimensions
width = width,
height = height,
#adds hovertext
text = text)
#add plots
fig <- fig %>% add_trace(type = graph_type, x = ~hp, y = ~generation, name = "Health points", color=I(hpcol))
fig <- fig %>% add_trace(type = graph_type, x = ~attack, y = ~generation, name = "Attack", color=I(attcol))
fig <- fig %>% add_trace(type = graph_type, x = ~defence, y = ~generation, name = "Defence", color=I(defcol))
fig <- fig %>% add_trace(type = graph_type, x = ~sp_attack, y = ~generation, name = "Special attack", color=I(specattcol))
fig <- fig %>% add_trace(type = graph_type, x = ~sp_defence, y = ~generation, name = "Special defence", color=I(specdefcol))
fig <- fig %>% add_trace(type = graph_type, x = ~speed, y = ~generation, name = "Speed", color=I(speedcol))
fig <- fig %>% layout(
#grouping traces by generation
boxmode = "group",
#add title
title = "Base statistics across Pokémon generations",
#reverse axis so Gen 1 shows at the top
yaxis = list(autorange = "reversed",
#add y axis label
title = "Generation"),
#add x axis label
xaxis = list(title = list( text = "Points"))
)
#plot boxplot
fig
From the visualisation, we can see that:
Overall, since the fourth generation there is no evidence of powercreep within the Pokémon games. Rather than adding increasingly more powerful Pokémon, focus may be on adding more specialised Pokémon. This would also explain why there are few trends in the distribution of individual stats across generations. Further evidence for specialisation comes from the fact that other than Eternatus Eternamax (who is an outlier on total base points), no Pokémon is an outlier on more than two of the base statistics.
The main limitation with this dataset is that the data represents the most up to date stats for each Pokémon. Some Pokémon’s stats have changed across the generations e.g. in generation one, special attack and special defense were both represented by a stat called special, which was separated in subsequent games. Therefore, this dataset may mask evidence of powercreep. It would be interesting to replicate these visualisations using the stats from when a Pokémon was introduced. Another option with the present dataset would be to compare stats across Pokémon types to see whether one type is superior.
It would also be interesting to compare the most popular Pokémon with their base statistics. Do people like a specific Pokémon because they are the best, or do other factors such as nostalgia or cuteness come into play? Data for this could come from the Pokémon of the year survey conducted by Pokémon or Google search data.
This file was created using:
Packrat was used for package management.
If you would like to run this code, please download unbundlepackrat.r and assignment-2021-05-20.tar from the github repository. Instructions on how to unbundle assignment-2021-05-20.tar can be found within unbundlepackrat.r.
The full repository for this analysis can be found here.